Marc Rene Broghammer, University of Constance, broghama@inf.uni-konstanz.de
Juergen Schniertshauer, University of Constance, schniert@inf.uni-konstanz.de
Dr. Peter Bak, University of Constance, Peter.Bak@uni-konstanz.de
Our project makes use of the Konstanz Information Miner (KNIME). KNIME is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models. KNIME is developed by the Chair for Bioinformatics and Information Mining at the University of Konstanz, Germany. KNIME is based on the Eclipse platform and, through its modular design, easily extensible. When desired, custom pipeline nodes can be implemented in KNIME within hours thus extending KNIME to comprehend and provide first-tier support for highly domain-specific data. KNIME also offers the possibility to integrate small code snippets as Java- and R-snippets. Creating the specific pipeline to solve the Mini Challenge in an iterative process, we estimate our effort to approximately 20 to 30 hours, excluding the time we needed to get used to the tool.
Further analysis was made with the SAV framework, written by Stefan Moritz Koch as part of his Master thesis at the Working Group for Databases. Data Analysis and Visualization at the University of Constance. The framework is currently under development and geared toward helping analysts find temporal patterns based on a combination of automatic and visual methods. We make use of the TreeMap-visualization that is part of the framework to analyze issues that cannot be covered by the simple visualizations integrated into KNIME.
Video:
View the video for Mini Challenge II
ANSWERS:
MC2.1: Analyze the records you have been given to characterize the spread of the disease. You should take into consideration symptoms of the disease, mortality rates, temporal patterns of the onset, peak and recovery of the disease. Health officials hope that whatever tools are developed to analyze this data might be available for the next epidemic outbreak. They are looking for visualization tools that will save them analysis time so they can react quickly.
Figure 1: The analytical pipeline used to solve Mini Challenge 2
Figure
1 shows the analytical pipeline we developed to solve MC2. The
solution is based on the Visual Analytics pipeline.
Our goal
was to tightly integrate automated data mining, user interaction
and visualization. Automated data mining provides the scalability
necessary to handle large datasets. The user contributes human
perception, flexibility, fine tuning of parameters and pattern
recognition. Visualization allows us to combine both methods
optimally.
We use a combination of simple viszalizations and Treemaps to
support human interaction. Simple visualizations enable rapid
extraction of information while the Treemap supplements our approach.
The
KNIME pipeline is ideally suited for the visualization and analysis
requirements of task 1 as new diseases can be analyzed with little
effort.
2. Data Processing
2.1 Data
Integration
The challenge data is organized into two
csv-files for each country. One file consists of the
hospitalization data for all patients. The other file contains the
data associated with the deaths of patients. The first step is
to integrate the data by joining the files for each country and combining the results.
Due to the 'patient-ID' from the original
dataset only being unique for each country, a new 'patient-ID' key is
required. The attribute
'dies' is also derived.
Table1: Derived Table Format
patient-id |
hosp. date |
gender |
age |
syndrome |
country |
death date |
dies |
2.1.1 First temporal analysis
This data is used
to do first analysis concerning the difference between
hospitalization date and date of death. It shows that almost every
patient who died had been in hospital for 8 days.
2.2
Data Processing
Our data cleaning and processing phase
consisted of
extracting the most frequent abbreviations found in the 'syndrome'
column, which we defined as strings of length <= 3. The
frequencies of the abbreviations were then visualized in a histogram
and we wrote a replacement list for frequent abbreviations. Next, the
data has to be transformed into a table containing a uniform seperation
between single symptoms in a symptoms string. We standardise the
symptoms to a comma-separated
form.
After
this step, the symptoms, now standardised, are once again visualized
and we see, that there are no symptoms with a number of dying patients between 392 and 3275. In further analysis, we consider only the symptoms with more than 1000
deaths as important. We scan the list for equivalent notations
of the same symptoms and replace them. We now have our final list of symptoms.
3.
Analysis
3.1 The symptoms
Using a timeseries plot,
we were able to detect minor symptoms from their correlation with the
primary symptoms over time. In summary, we conclude that the following
symptoms are caused by the virus:
Major symptoms:
Vomiting
Diarrhea
Nose Bleeding
Abdominal Pain
Back Pain
Minor symptoms:
Conjunctivitis Red
Encephalitis
Facial Swelling
Hearing Loss
Proteinuria
Tremor
The minor symptoms did not seem to be important initially but the
lower curves in Figure 3 clearly follow the temporal pattern
of
the main symptoms visualized by the upper curve.
The
following analysis includes all relevant explored combinations.
3.2
Mortality Rates
A
quick look at an ordered histogram of the mortality rates shows some
outliers with 100% mortality rate, relevant symptoms with
approximately 10% mortality rate and lots of noise symptoms, with 1%
or less. Figure 4 shows the change of mortality rates over time. We
expected there to be a simple correlation between symptoms and
mortality rate, but the TreeMap reveals that mortality rates are not
constant for a given syndrome. We can also see that mortality rates
do not follow a clear pattern dependent on the symptom or time.
Moreover there is the anomaly of the mortality rates peaking towards
the end of the recovery period for almost every country.
3.3
Temporal Patterns and spread of the disease
The frequencies of all relevant symptoms begin to rise around April 20, peak around May 16, and level off around June 13.
Based on the onset and the peaks of the disease, we can charakterize the spread the following way:
MC2.2: Compare the outbreak across cities. Factors to consider include timing of outbreaks, numbers of people infected and recovery ability of the individual cities. Identify any anomalies you found.
Figure 1: The analytical pipeline used to solve Mini Challenge 2
Figure
1 shows the analytical pipeline we developed to solve MC2. This is the same pipeline we have used to solve Task 1. The
solution is based on the Visual Analytics pipeline.
Our goal
was to tightly integrate automated data mining, user interaction
and visualization. Automated data mining provides the scalability
necessary to handle large datasets. The user contributes human
perception, flexibility, fine tuning of parameters and pattern
recognition. Visualization allows us to combine both methods
optimally.
We use a combination of simple viszalizations and Treemaps to
support human interaction. Simple visualizations enable rapid
extraction of information while the Treemap supplements our approach.
The
KNIME pipeline is ideally suited for the visualization and analysis
requirements of task 1 as new diseases can be analyzed with little
effort.
2. Data Processing
2.1 Data
Integration
The challenge data is organized into two
csv-files for each country. One file consists of the
hospitalization data for all patients. The other file contains the
data associated with the deaths of patients. The first step is
to integrate the data by joining the files for each country and combining the results.
Due to the 'patient-ID' from the original
dataset only being unique for each country a new 'patient-ID' key is
required. The attribute
'dies' is also derived.
Table1: Derived Table Format
patient-id |
hosp. date |
gender |
age |
syndrome |
country |
death date |
dies |
2.1.1 First temporal analysis
This data is used
to do first analysisconcerning the difference between
hospitalization date and date of death. It shows that almost every
patient who died had been in hospital for 8 days.
2.2
Data Processing
Our data cleaning and processing phase
consisted of
extracting the most frequent abbreviations found in the 'syndrome'
column, which we defined as strings of length <= 3. The frequencies
of the abbreviations were then visualized in a histogram and we wrote a
replacement list for frequent abbreviations. Next, the data has to be
transformed into a table containing a uniform seperation between single
symptoms in a symptoms string. We standardise the symptoms to a
comma-separated
form.
After
this step, the symptoms, now standardised, are once again visualized
and we see here too, that a lower bound exists indicating
that
a gap in the symptoms with a total mortality
of between 392 and
3275. In further analysis, we consider only the symptoms with more than 1000
deaths as important. We scan the list for equivalent notations
of the same symptoms and replace them. We now have our final list of symptoms.
3 Analysis
3.1 Countries to consider
3.2 Temporal Patterns between countries
We now focus our attention on the the visualization in Figure 3. At first glance, it seems as if Lebanon is the origin of the disease because it starts at a high level of hospitalized people. On the other hand, the data from Lebanon does not show a constant increase at this level of temporal detail. Lebanon also starts at a comparably high rate of infection with approximately 10% of the maximum number of hospitalized people. At a lower level of smoothing we can see that the Lebanese data is unusually "noisy" and oscillates around 10% of its maximum for the first days before it begins to rise.
3.3 An Anomaly: Countries exhibiting an uneven growth in death counts
Examining the data at a lower level of temporal detail reveals two broad differences in the way the disease developed. As you can see in figure 5, in Columbia, Lebanon, Iran, Saudi-Arabia and Venezuela there is a kind of break during the onset. In all other countries there is a steady growth in death counts, even when viewed as a three day moving average. This can phase of lower death counts can also be seen at the mortality rates (see Task I Figure 4), which show a small valley during this period of time.